Issues in developing LVCSR System for Dravidian Languages: An exhaustive case study for Tamil
نویسندگان
چکیده
Research in the area of Large Vocabulary Continuous Speech Recognition (LVCSR) for Indian languages has not seen the level of advancement as in English since there is a dearth of large scale speech and language corpora even today. Tamil is one among the four major Dravidian languages spoken in southern India. One of the characteristics of Tamil is that it is morphologically very rich. This quality poses a great challenge for developing LVCSR systems. In this paper, we have analyzed a Tamil corpora of 10 million words and have exhibited the results of a type-token analysis which implies the morphological richness of Tamil. We have demonstrated a grapheme-to-phoneme (G2P) mapping system for Tamil which gives an accuracy of 99.56%. We have shown the impact of important parameters such as absolute beam width, language weight, number of gaussians and the number of senones on speech recognition accuracy for limited vocabulary (3k). We have presented the results of large open vocabulary speech recognition task for vocabulary sizes of 30k, 60k and 100k on the speaker independent task. The Out Of Vocabulary (OOV) rates are 20.2%, 15.8%, 12.8% respectively. The accuracies are 43.59%, 47.11% and 43.52% respectively.
منابع مشابه
Automatic continuous speech recogniser for Dravidian languages using the auto associative neural network
In recent times with the extensive improvement of computers, numerous methods of data interchange between man and computer are revealed. It aims to provide an efficient way for human to communicate with computers exclusively for people with disabilities who face diversity of obstacles while using computers. This paper predominantly focuses on developing an efficient speech recognition system fo...
متن کاملTamil NER – Coping with Real Time Challenges
This paper describes various challenges encountered while developing an automatic Named Entity Recognition (NER) using Conditional Random Fields (CRFs) for Tamil. We also discuss how we have overcome some of these challenges. Though most of the challenges in NER discussed here are common to many Indian languages, in this work the focus is on Tamil, a South Indian language belonging to Dravidian...
متن کاملDevelopment of Telugu-Tamil Transfer-Based Machine Translation system: With Special reference to Divergence Index
The existence of translation divergence precludes straightforward mapping in machine translation (MT) system. An increase in the number of divergences also increases the complexity, especially in linguistically motivated transfer-based MT systems. In other words, divergence is directly proportional to the complexity of MT. Here we propose a divergence index (DI) to quantify the number of parame...
متن کاملA Generic Anaphora Resolution Engine for Indian Languages
In this paper, we present a generic anaphora engine for Indian languages, which are mostly resource poor languages. We have analysed the similarit ies and variations between pronouns and their agreement with antecedents in Indian languages. The generic algorithm developed uses the morphological richness of Indian languages. The machine learn ing approach uses the features which can handle major...
متن کاملMultilingual and Crosslingual Speech Recognition
This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Tu...
متن کامل